A resource-light approach to morpho-syntactic tagging.Anna Feldman and Jirka Hana

نویسنده

  • Lieve Macken
چکیده

Anna Feldman and Jirka Hana had a problem. Wanting to extract Russian verb frames, they lacked a tool for the necessary first step: morphological analysis of Russian words, disambiguated for context. To avoid the significant overhead of building a contextual-ized morphological analyzer from scratch, Feldman and Hana wondered if an analyzer that was already available for Czech would perform adequately on Russian. This book is the culmination of five years' research on projecting to a target language a contextualized morphological analyzer that was built for a separate source language, when both source and target belong to the same language family (Slavic, Romance, etc.). The authors succeed at building competitive morphological analysis systems for the target languages they consider (Russian, Catalan, and Portuguese), while expending a minimum of effort to construct specialized resources for these targets. At the culmination of their book, in Chapter 7, Feldman and Hana report a 6% absolute improvement, 79.7% vs. 73.5% labeling accuracy, when using a Czech morphological analyzer projected to Russian as opposed to training a statistical analyzer directly on a small sample (1,758 words) of hand-annotated Russian. Unfortunately missing is a formal demonstration that hand-labeling 1,758 words with morphological analyses requires an equivalent human effort to projecting an analyzer from one language to another. The authors' final morphological projection incorporates a variety of improvements that require human intervention: from a hand-built morphological guesser on the target language side, to hand-defined rules that identify cognates between source and target languages and that render the syntactic structure of the source language more similar to the target's structure. Nowhere do the authors report the person-hours required to build each of these components and the reader is left to trust that constructing the projected systems takes as little time as is implied. A word of warning to those with a linguistics background: The authors prefer the language of natural language processing (NLP) to standard linguistic terminology. As a prime example, the title of this book includes the phrase morpho-syntactic tagging, a term from NLP. Part-of-speech tagging, in languages with little inflectional morphology, such as English, involves assigning to each word one part-of-speech tag from a small set of 50 or so possible tags. For the more inflected Slavic and Romance languages considered in this book, the tag sets include as many as 4,000 tags, each marking a full suite of morpho-syntactic features, such as tense, case, or number. …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Challenges of Cheap Resource Creation

We describe the challenges of resource creation for a resource-light system for morphological tagging of fusional languages (Feldman and Hana, 2010). The constraints on resources (time, expertise, and money) introduce challenges that are not present in development of morphological tools and corpora in the usual, resource intensive way.

متن کامل

A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languag...

متن کامل

Knowledge- and Labor-light Morphological Analysis1

We describe a knowledge and labor-light system for morphological analysis of fusional languages, exemplified by analysis of Czech. Our approach takes the middle road between completely unsupervised systems on the one hand and systems with extensive manually-created resources on the other. For the majority of languages and applications neither of these extreme approaches seems warranted. The kno...

متن کامل

A low-budget tagger for Old Czech

The paper describes a tagger for Old Czech (1200-1500 AD), a fusional language with rich morphology. The practical restrictions (no native speakers, limited corpora and lexicons, limited funding) make Old Czech an ideal candidate for a resource-light crosslingual method that we have been developing (e.g. Hana et al., 2004; Feldman and Hana, 2010). We use a traditional supervised tagger. However...

متن کامل

Resource-Light Approaches to Computational Morphology Part 1: Monolingual Approaches

This article surveys resource-light monolingual approaches to morphological analysis and tagging. While supervised analyzers and taggers are very accurate, they are extremely expensive to create. Therefore, most of the world languages and dialects have no realistic prospect for morphological tools created in this way. The weakly-supervised approaches aim to minimize time, expertise and/or finan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • LLC

دوره 25  شماره 

صفحات  -

تاریخ انتشار 2010